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Abstract. We show that, given a string s of length n, with constant 
memory and logarithmic passes over a constant number of streams we 
can build a context-free grammar that generates s and only s and whose 

size is within an O^min ^glogg, \fnj log nj^j -factor of the minimum g. 

This stands in contrast to our previous result that, with polylogarithmic 
memory and polylogarithmic passes over a single stream, we cannot build 
such a grammar whose size is within any polynomial of g. 



1 Introduction 

In the past decade, the ever-increasing amount of data to be stored and ma- 
nipulated has inspired intense interest in both grammar-based compression and 
streaming algorithms, resulting in many practical algorithms and upper and 
lower bounds for both problems. Nevertheless, there has been relatively little 
study of grammar-based compression in a streaming model. In a previous pa- 
per pQ we proved limits on the quality of the compression we can achieve with 
polylogarithmic memory and polylogarithmic passes over a single stream. In this 
paper we show how to achieve better compression with constant memory and 
logarithmic passes over a constant number of streams. 

For grammar-based compression of a string s of length n, we try to build a 
small context-free grammar (CFG) that generates s and only s. This is useful 
not only for compression but also for, e.g., indexing (213] and speeding up dy- 
namic programs [3]. (It is sometimes desirable for the CFG to be in Chomsky 
normal form (CNF), in which case it is also known as a straight-line program.) 
We can measure our success in terms of universality [5] , empirical entropy [B] or 
the ratio between the size of our CFG and the size g = ^(log n) of the smallest 
such grammar. In this paper we consider the third and last measure. Storer and 
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Szymanski [7] showed that determining the size of the smallest grammar is NP- 
complete; Charikar et al. [8] showed it cannot be approximated to within a small 
constant factor in polynomial time unless P = NP, and that even approximat- 
ing it to within a factor of o(logn/ log log n) in polynomial time would require 
progress on a well-studied algebraic problem. Charikar et al. and Rytter [9] in- 
dependently gave O(log(n/g))-approximation algorithms, both based on turning 
the LZ77 [10] parse of s into a CFG and both initially presented at conference 
in 2002; Sakamoto [TT] then proposed another 0(log(n/g))-approximation algo- 
rithm, based on Re-Pair [T^]. Sakamoto, Kida and Shimozono [12] gave a linear- 
time C((log g)(log n))-approximation algorithm that uses O(glogg) workspace, 
again based on LZ77; together with Maruyama, they [14] recently modified their 
algorithm to run in C(nlog* n) time but achieve an C((log n)(log* n)) approxi- 
mation ratio. 

A few years before Charikar et al.'s and Rytter's papers sparked a surge of in- 
terest in grammar-based compression, a paper by Alon, Matias and Szegedy [T5] 
did the same for streaming algorithms. We refer the reader to Babcock et al. [16] 
and Muthukrishnan [17] for a thorough introduction to streaming algorithms. In 
this paper, however, we are most concerned with more powerful streaming mod- 
els than these authors consider, ones that allow the use of multiple streams. A 
number of recent papers have considered such models, beginning with Grohc and 
Schweikardt's [TB] definition of (r, s, i)-bounded Turing Machines, which use at 
most r reversals over t "external- memory" tapes and a total of at most s space on 
"internal-memory" tapes to which they have unrestricted access. While Munro 
and Patcrson [19] proved tight bounds for sorting with one tape three decades 
ago, Grohe and Schweikardt proved the first tight bounds for sorting with mul- 
tiple tapes. Grohe, Hernich and Schweikardt [20) proved lower bounds in this 
model for randomized algorithms with one-sided error, and Beame, Jayram and 
Rudra |21j proved lower bounds for algorithms with two-sided error (renaming 
the model "read/ write streams"). Beame and Huynh |22j revisited the prob- 
lem considered by Alon, Matias and Szegedy, i.e., approximating frequency mo- 
ments, and proved lower bounds for read/ write stream algorithms. Hernich and 
Schweikardt [23 related results for read/write stream algorithms to results in 
classical complexity theory, including results by Chen and Yap [24] on reversal 
complexity. Hernich and Schweikardt's paper drew our attention to a theorem 
by Chen and Yap implying that, if a problem can be solved deterministically 
with read-only access to the input and logarithmic workspace then, in theory, 
it can be solved with constant memory and logarithmic passes (in either direc- 
tion) over a constant number of read/write streams. This theorem is the key 
to our main result in this paper. Unfortunately, the constants involved in Chen 
and Yap's construction are enormous; we leave as future work finding a more 
practical proof of our results. 

The study of compression in something like a streaming model goes back 
at least a decade, to work by Sheinwald, Lempel and Ziv [25] and De Agostino 
and Storer [26] , As far as we know, however, our joint paper with Manzini [27] 
was the first to give nearly tight bounds in a standard streaming model. In that 
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paper we proved nearly matching bounds on the compression achievable with 
a constant amount of internal memory and one pass over the input, as well as 
upper and lower bounds for LZ77 with a sliding window whose size grows as 
a function of the number of characters encoded. (Our upper bound for LZ77 
used a theorem due to Kosaraju and Manzini [2 8) about quasi-distinct pars- 
ings, a subject recently revisited by Amir, Aumann, Levy and Roshko [29] .) 
Shortly thereafter, Albert, Mayordomo, Moser and Perifel [30 showed the com- 
pression achieved by LZ78 [31] is incomparable to that achievable by pushdown 
transducers; Mayordomo and Moser then extended their result to show both 
kinds of compression are incomparable with that achievable by online algorithms 
with poly logarithmic memory. (A somewhat similar subject, recognition of the 
context-sensitive Dyck languages in a streaming model, was recently broached by 
Magniez, Mathieu and Nayak [33], who gave a one-pass algorithm with one-sided 
error that uses polylogarithmic time per character and Oin 1 / 2 logn) space.) In 
a recent paper with Ferragina and Manzini |34) we demonstrated the practicality 
of streaming algorithms for compression in external memory. 

In a recent paper [1] we proved several lower bounds for compression al- 
gorithms that use a single stream, all based on an automata-theoretic lemma: 
suppose a machine implements a lossless compression algorithm using sequen- 
tial accesses to a single tape that initially holds the input; then we can recon- 
struct any substring given, for every pass, the machine's configurations when 
it reaches and leaves the part of the tape that initially holds that substring, 
together with all the output it generates while over that part. (We note similar 
arguments appear in computational complexity, where they are referred to as 
"crossing sequences", and in communication complexity.) It follows that, if a 
streaming compression algorithm is restricted to using polylogarithmic memory 
and polylogarithmic passes over one stream, then there are periodic strings with 
polylogarithmic periods such that, even though the strings are very compressible 
as, e.g., CFGs, the algorithm must encode them using a linear number of bits; 
therefore, no such algorithm can approximate the smallest-grammar problem to 
within any polynomial of the minimum size. Such arguments cannot prove lower 
bounds for algorithms with multiple streams, however, and we left open the ques- 
tion of whether extra streams allow us to achieve a polynomial approximation. 
In this paper we use Chen and Yap's result to confirm they do: we show how, 
with logarithmic workspace, we can compute the LZ77 parse and turn that into a 



CFG in CNF while increasing the size by a factor of O ^min \^g log g, yfnj log nj J 
— i.e., at most polynomially in g. It follows that we can achieve that approxima- 
tion ratio while using constant memory and logarithmic passes over a constant 
number of streams. 





2 LZ77 in a Streaming Model 



Our starting point is the same as that of Charikar et al., Rytter and Sakamoto, 
Kida and Shimozono, but we pay even more attention to workspace than the last 
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set of authors. Specifically, we begin by considering the variant of LZ77 consid- 
ered by Charikar et al. and Rytter, which does not allow self-referencing phrases 
but still produces a parse whose size is at most as large as that of the smallest 
grammar. Each phrase in this parse is either a single character or a substring of 
the prefix of s already parsed. For example, the parse of "how-much-wood-would- 
a-woodchuck-chuck-if-a-woodchuck-could-chuck-wood?" is "h|o|w|-|m|u|c|h|-|w|o| 
o|d|-wo|u|l|d-|a|-wood|ch|uc|k|-|chuck-|i|f|-a-woodchuck-c|ould-|chuck-|wood|?" . 

Lemma 1 (Charikar et al., 2002; Rytter, 2002). The number of phrases 
in the LZ77 parse is a lower bound on the size of the smallest grammar. 

As an aside, we note that Lemma Q] and results from our previous paper [T] 
together imply we cannot compute the LZ77 parse with one stream when the 
product of the memory and passes is sublinear in n. 

It is not difficult to show that this LZ77 parse — like the original — can be 
computed with read-only access to the input and logarithmic workspace. Pseu- 
docode for doing this appears as Algorithm [TJ On the example above, this pseu- 
docode produces "how-much-wood(9,3)u[(13,2)a(9,5)(7,2)(6,2)k-(27,6)if(20,14) 
(16,5)(27,6)(10,4)?". 

Lemma 2. We can compute the LZ 77 parse with logarithmic workspace. 

Proof. The first phrase in the parse is the first letter in s; after outputting this, 
we always keep a pointer t to the division between the prefix already parsed and 
the suffix yet to be parsed. To compute each later phrase in turn, we check the 
length of the longest common prefix of s[i..t — 1] and s[i..n], for 1 < i < t; if 
the longest match has length or 1, we output s[t]; otherwise, we output the 
value of the minimal i that maximizes the length of the longest common prefix, 
together with that length. This takes a constant number of pointers into s and 
a constant number of O(logn)-bit counters. □ 

Combined with Chen and Yap's theorem below, Lemma [5] implies that we 
can compute the LZ77 parse with constant workspace and logarithmic passes 
over a constant number of streams. 

Theorem 1 (Chen and Yap, 1991). If a function can be computed with log- 
arithmic workspace, then it can be computed with constant workspace and loga- 
rithmic passes over a constant number of streams. 

As an aside, we note that Chen and Yap's theorem is actually much stronger 
than what we state here: they proved that, if f(n) = J?(logn) is reversal- 
computable (see [24] or [23] for an explanation) and a problem can be solved 
deterministically in f{n) time, then it can be solved with constant workspace 
and 0(f(n)) passes over a constant number of tapes. Chen and Yap showed 
how a reversal-bounded Turing machine can simulate a space-bounded Turing 
machine by building a table of the possible configurations of the space-bounded 
machine. Schweikardt [35] pointed out that "this is of no practical use, since 
the resulting algorithm produces huge intermediate results, but it is of major 
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t <- 1; 

while t < n do 

rnaxjmatch <s— 0; 
maxjength <— 0; 
for i 1 . . . t — 1 do 

while s[i + j] — s[t + j] do 

L 3 «- J + i; 

if j > maxjength then 
rnaxjmatch z; 
maxjength <S— j; 

if maxjength < 1 then 

print s[f]; 

i «- t + 1; 
else 

print {rnaxjmatch, maxjength); 
t 4— t + maxjength; 

Algorithm 1: pseudocode for computing the LZ77 parse in logarithmic 
workspace 



theoretical interest" because it implies that a number of lower bounds are tight. 
We leave as future work finding a more practical proof our our results. 

In the next section we prove the following lemma, which is the most technical 
part of this paper. By the size of a CFG, we mean the number of symbols on the 
righthand sides of the productions; notice this is at most a logarithmic factor 
less than the number of bits needed to express the CFG. 

Lemma 3. With logarithmic workspace we can turn the LZ77 parse into a CFG 
whose size is within a O^min (jjlogg, ^/n/lognj^ -factor of minimum. 

Together with Lemma[5]and Theorem[TJ Lemma [3] immediately implies our main 
result. 

Theorem 2. With constant workspace and logarithmic passes over a constant 
number of streams, we can build a CFG generating s and only s whose size is 
within a O^min ^glog g, \fnj log n^j ^ -factor of minimum. 

3 Logspace CFG Construction 

Unlike the LZ78 parse, the LZ77 parse cannot normally be viewed as a CFG, 
because the substring to which a phrase matches may begin or end in the mid- 
dle of a preceding phrase. We note this obstacle has been considered by other 
authors in other circumstances, e.g., by Navarro and Raffinot [36] for pattern 
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matching. Fortunately, we can remove this obstacle in logarithmic workspace, 
without increasing the number of phrases more than quadratically. To do this, 
for each phrase for which we output (i,£), we ensure s[i] is the first character in 
a phrase and s[i + £] is the last character in a phrase (by breaking phrases in 
two, if necessary). For example, the parse 



where the thick lines indicate new breaks and superscripts indicate which breaks 
cause the new ones (which are subscripted). Notice the break "a-woodchuck- 
c|^ 4 ould" causes both "w |iould" (matching "ould") and "a-woodchuck-c ^huck" 
(matching "a-woodchuck-c"); in turn, the latter new break causes "woodc | 3 huck" 
(matching "huck"), which is why it has a superscript 3. 

Lemma 4. Breaking the phrases takes at most logarithmic workspace and at 
most squares the number of phrases. Afterwards, every phrase is either a single 
character or the concatenation of complete, consecutive, preceding phrases. 

Proof. Since the phrases' start points are the partial sums of their lengths, we 
can compute them with logarithmic workspace; therefore, we can assume without 
loss of generality that the start points are stored with the phrases. We start with 
the rightmost phrase and work left. For each phrase's endpoints, we compute 
the corresponding position in the matching, preceding substring (notice that the 
position corresponding to one phrase's finish may not be the one corresponding to 
the start of the next phrase to the right) and insert a new break there, if there is 
not one already. If we have inserted a new break, then we iterate, computing the 
position corresponding to the new break; eventually, we will reach a point where 
there is already a break, so the iteration will stop. This process requires only a 
constant number of pointers, so we can perform it with logarithmic workspace. 
Also, since each phrase is broken at most twice for each of the phrases that 
initially follow it in the parse, the final number of phrases is at most the square 
of the initial number. By inspection, after the process is complete every phrase 
is the concatenation of complete, consecutive, preceding phrases. □ 

Notice that, after we break the phrases as described above, we can view the 
parse as a CFG. For example, the parse for our running example corresponds to 



h|o|w|-|m|u|c|h|-|w|o|o|d|-wo|u|l|d-|a|-wood|ch|uc|k|-| chuck 
-|i|f|-a-woodchuck-c|ould-|chuck-|wood|?" 



becomes 




i|f| 2 -a-woodchuck-c| 1,4 ould-|chuck-|wood|?" 



X ^X 
X a -> h 





Xio • • • X13 



Xis -> d 



X 3 i — > Xig . . . X 2 7 
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where Xo is the starting nonterminal. Unfortunately, while the number of pro- 
ductions is polynomial in the number of phrases in the LZ77 parse, it is not 
clear the size is and, moreover, the grammar is not in CNF. Since all the right- 
hand sides of the productions arc cither terminals or sequences of consecutive 
nonterminals, we could put the grammar into CNF by squaring the number of 
nonterminals — giving us an approximation ratio cubic in g. This would still be 
enough for us to prove our main result but, fortunately, such a large increase is 
not necessary. 

Lemma 5. Putting the CFG into CNF takes logarithmic workspace and in- 
creases the number of productions by at most a logarithmic factor. Afterwards, 
the size of the grammar is proportional to the number of productions. 

Proof. We build a forest of complete binary trees whose leaves are the nontermi- 
nals: if we consider the trees in order by size, the nonterminals appear in order 
from the leftmost leaf of the first tree to the rightmost leaf of the last tree; each 
tree is as large as possible, given the number of nonterminals remaining after 
we build the trees to its left. Notice there are O(logg) such trees, of total size 
at most C(g 2 ). We then assign a new nonterminal to each internal node and 
output a production which takes that nonterminal to its children. This takes 
logarithmic workspace and increases the number of productions by a constant 
factor. 

Notice any sequence of consecutive nonterminals that spans at least two trees, 
can be written as the concatenation of two consecutive sequences, one of which 
ends with the rightmost leaf in one tree and the other of which starts with the 
leftmost leaf in the next tree. Consider a sequence ending with the rightmost leaf 
in a tree; dealing with one that starts with a leftmost leaf is symmetric. If the 
sequence completely contains that tree, we can write a binary production that 
splits the sequence into the prefix in the preceding trees, which is the expansion 
of a new nonterminal, and the leaves in that tree, which are the expansion of 
its root. We need do this 0(logg) times before the remaining subsequence is 
contained within a single tree. After that, we repeatedly produce new binary 
productions that split the subsequence into prefixes, again the expansions of 
new nonterminals, and suffixes, the expansions of roots of the largest possible 
complete subtree. Since the size of the largest possible complete subtree shrinks 
by a factor of two at each step (or, equivalently, the height of its root decreases 
by 1), we need repeat O(logg) times. Again, this takes logarithmic workspace 
(we will give more details in the full version of this paper). 

In summary, we may replace each production with O(logg) new, binary pro- 
ductions. Since the productions are binary, the number of symbols on the right- 
hand sides is linear in the number of productions themselves. □ 

Lemma [5] is our most detailed result, and the diagram below showing the 
construction with our running example is also somewhat detailed. On the left 
are modifications of the original productions, now made binary; in the middle 
are productions for the internal nodes of the binary trees; and on the right are 
productions breaking down the consecutive subsequences that appear on the 
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righthand sides of the productions in the left column, until the subsequences are 
single, original nonterminals or nonterminals for nodes in the binary trees (i.e., 
those on the lefthand sides of the productions in the middle column) . 



Xo — > Xi,32 X33,35 
Xi -> h 

X13 -t d 

X14 —> Xg Xio 
X15 -> O 

X31 — > Xig ! 24 X25,27 
X34 — > Xio,12 X13 

x 35 ^? 



Xl,32 — > Xi 16 Xi7 j 32 
Xi,i6 - > Xi g X916 
Xl,8 X14 X53 

Xl7,32 — > Xi7 t 24 Xis i 32 Xig j 24 — > Xig^O X2i,24 

Xi7,24 — > Xi7 t 20 X21,24 X25 : 27 — ► X25,26 X27 

X29,32 — > X29 ! 30 X3132 Xio,l2 — > Xio X1112 
X33,35 — ► X 33! 34 X35 
X1.2 Xi X2 



X3334 — > X33 X34 



Combined with Lemma [TJ Lemmas 2] and [5] imply that with logarithmic 
workspace we can build a CFG in CNF whose size is O (g 2 log g) . We can use 
a similar approach with binary trees to build a CFG in CNF of size 0(n) that 
generates s and only s, still using logarithmic workspace. If we combine all non- 
terminals that have the same expansion, which also takes logarithmic workspace, 
then this becomes Kieffer, Yang, Nelson and Cosman's [37] Bisection algo- 
rithm, which gives an o(y/n/ log nj -approximation 8 . By taking the smaller 

of these two CFGs we achieve an o(min (j)\ogg, yfnj log rijj -approximation. 
Therefore, as we claimed in Lemma |3j with logarithmic workspace we can turn 
the LZ77 parse into a CFG whose size is within a 0\mva (^glogg, y/n/ log 
factor of minimum. 



4 Recent Work 

We recently improved the bound on the approximation ratio in Lemma [3] from 
O^min log g, \fnj log nj J to O^min ^g, yfnj log rij J . The key observation is 
that, by the definition of the LZ77 parse, the first occurrence of any substring 
must touch or cross a break between phrases. Consider any phrase in the parse 
obtained by applying Lemma|4]to the LZ77 parse. By the observation above, that 
phrase can be written as the concatenation of some consecutive new phrases (all 
contained within one old phrase and ending at that old phrase's right end), some 
consecutive old phrases, and some more consecutive new phrases (all contained 
within one old phrase and starting at the old phrase's left end). Since there 
are O(g) old phrases, there are 0(g 2 ) sequences of consecutive old phrases; 
since there are 0(g 2 ) new phrases, there are 0(g 2 ) sequences of consecutive 
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new phrases that are contained in one old phrase and either start at that old 
phrase's right end or end at that old phrase's left end. 

While working on the improvement above, we realized how to improve the 
bound further, to 0^min (g, 4V log ™^ J . To do this, we choose a value b between 

2 and n and, for < i < \og b n, we associate a nonterminal to each of the b 
blocks of [n/6 4 ] characters to the left and right of each break; we thus start 
building the grammar with 0(bglog b n) nonterminals. We then add 0(6<?log b n) 
binary productions such that any sequence of nonterminals associated with a 
consecutive sequence of blocks, can be derived from 0(1) nonterminals. Notice 
any substring is the concatenation of or 1 partial blocks, some number of full 
blocks to the left of a break, some number of blocks to the right of a break, and 
or 1 more partial blocks. We now add more binary productions as follows: we 
start with s (the only block of length [n/6°] = n); find the first break it touches 
or crosses (in this case it is the start of s); consider s as the concatenation 
of blocks of size (in this case only the rightmost block can be partial); 

associate nonterminals to the partial blocks (if they exist); add 0(1) productions 
to take the symbol associated to s (in this case, the start symbol) to the sequence 
of nonterminals associated with the smaller blocks in order from left to right; and 
recurse on each of the smaller blocks. To guarantee each smaller block touches or 
crosses a break, we work on the first occurrence in s of the substring contained in 
that block. We stop recursing when the block size is 1, and add 0(bg) productions 
taking those blocks' nonterminals to the appropriate characters. 

Analysis shows that the number of productions we add during the recursion 
is proportional to the number of blocks involved, either full or partial. Since 
the number of distinct full blocks in any level of recursion is 0(bg) and the 
number of partial blocks is at most twice the number of blocks (full or partial) 
in the previous level of recursion, the number of productions we add during the 
recursion is 0(2 lo S""& 5 ). Therefore, the grammar has size O(2 lo ^ n bg); when 

b = 2\ /I ° g ", this is oUV^g^J . The first of the two key observations that let 

us build the grammar in logarithmic workspace, is that we can store the index 
of a block (full or partial) with respect to the associated break, in 0(y^ogn) 
bits; therefore, we can store 0(1) indices for each of the 0(Vlog n) levels of 
recursion, in a total of 0(logn) bits. The second key observation is that, given 
the indices of the block we are working on in each level of recursion, with respect 
to the appropriate break, we can compute the start point and end point of the 
block we are currently working on in the deepest level of recursion. We will give 
details of these two improvements in the full version of this paper. 

While working on this second improvement, we realized that we can use the 
same ideas to build a compressed representation that allows efficient random 
access. We refer the reader to the recent papers by Kreft and Navarro [38] and 
Bille, Landau and Weimann [39] for background on this problem. Suppose that, 

for each of the ^2"^ log n g^/log nj full blocks described above, we store a pointer 
to the first occurrence in s of the substring in that block, as well as a pointer to 
the first break that first occurrence touches or crosses. Notice this takes a total 
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of 0\2 V logn g(logn) 3 / 2 j bits. Then, given a block's index and an offset in that 

block, in 0(1) time we can compute a smaller block's index and offset in that 
smaller block, such that the characters in those two positions are equal; if the 
larger block has size 1, in 0(1) time we can return the character. Since s itself 
is a block, an offset in it is just a character's position, and there are 0(vTogn) 
levels of recursion, it follows that we can access any character in 0(\/log nj time. 
Further analysis shows that it takes 0(v / logn + £) time to access a substring of 
length i. Of course, for any positive constant e, if we are willing to use 0(n e g) 
bits of space, then we can access any character in constant time. If we make the 
data structure slightly larger and more complicated — e.g., storing searchable 
partial sums at the block boundaries — then, at least for strings over fairly small 
alphabets, we can also support fast rank and select queries. 

This implementation makes it easy to see the data structure's relation to 
LZ77 and grammar-based compression. We can use a simpler implementation, 
however, and use LZ77 only in the analysis. Suppose that, for < i < log b n, we 
break s into consecutive blocks of length \n/b ll \ (the last block may be shorter), 
always starting from the first character of s. For each block, we store a pointer to 
the first occurrence in s of that block's substring. Given a block's index and an 
offset in that block, in 0(1) time we can again compute a smaller block's index 
and offset in that smaller block, such that the characters in those two positions 
are equal: the new block's index is the sum of pointer and the old offset, divided 
by the new block length and rounded down; the new offset is the sum of the 
pointer and the old offset, modulo the new block length. We can discard any 
block that cannot be visited during a query, so this data structure takes at most 
a constant factor more space than the one described above. Indeed, this data 
structure seems likely to be smaller in practice, because blocks of the same size 
can overlap in the previous data structure but cannot in this one. We plan to 
implement this data structure and report the results in a future paper. 
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